This session is meant to provide workshop attendees with hands-on experience building a natural language processing (NLP) pipeline and running a simple experiment. While we will provide the code needed to run the experiment, we will also provide links to additional coding resources. We strongly encourage attendees to expand and build on this code using these links as well as well as their own resources. To make this task fun as well as entertaining, we will build on and expand some Data Camp tutorials. The goal of these tutorials is to provide users with an overview of how to build simple NLP pipelines and utilize simple text mining techniques within the R tidy framework. Today we will examine songs by the artist Prince.
Before we can build our pipeline we need to set-up our environment by loading the specific R libraries containing the functions that we will need for the pipeline. I have also created functions for plotting, set color palettes and themes, and set style parameters for tables.
## Set-Up Environment
# load needed libraries
library(circlize) #chord diagram
library(caret)
library(corpustools)
library(dplyr) #data manipulation
library(doMC) # Library for parallel processing
library(e1071)
library(easyGgplot2)
library(ggraph) #ngram network diagrams
library(ggplot2) #visualizations
library(ggrepel) #`geom_label_repel`
library(gridExtra) #viewing multiple plots together
library(formattable) #for the color_tile function
library(igraph) #ngram network diagrams
library(kableExtra) #create a nicely formated HTML table
library(knitr) #for dynamic reporting
library(quanteda) #text processing functionality
library(RCurl) #loading data from url
library(RTextTools)
library(spacyr) #very powerful text mining library
library(SnowballC) #for stemming
library(tidyr) #Spread, separate, unite
library(tidytext) #text mining
library(tm) # general text mining functions, making document term matrixes
library(topicmodels)
library(widyr) #Use for pairwise correlation
library(wordcloud2) #creative visualizations
library(yarrr) #Pirate plot
# set-up custom color themes to use throughout
my_colors <- c("#44AF69", "#F8333C", "#FCAB10", "#2B9EB3", "#DBB6B6")
# create a general potting function
theme_lyrics <- function()
{
theme(plot.title = ggplot2::element_text(hjust = 0.5),
axis.text.x = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "none")
}
# customize the text tables for consistency using HTML formatting
my_kable_styling <- function(dat, caption) {
kableExtra::kable(dat, "html", escape = FALSE, caption = caption) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
full_width = FALSE)
}
Before we can build our pipeline we need data to analyze. For this workshop, we will be analyzing data scraped from the Billboard Charts. The data consists of 824 songs written by Prince and for each song we have:
* Lyrics: lyrics for each song (string)
* Song: the song title (string)
* Year: the year the song was released (integer)
* Album: the name of the album (character)
* Peak: the highest spot on the charts the song reached (integer)
* Genres: several columns of genres for different countries (character)
The data can be accessed using via hyperlink above to ‘Billboard Charts’ or it can be download from the Web directly. Below, I have provided code that uses the RCurl library. Using this library, we download the file and save it to an R data.frame. Once we have created the data.frame, we examine the data. Our data has rows and 1 columns with has the following columns: . We can also remove columns from the data set that we know we will not use for analysis. For this pipeline, we can remove most of the genre columns. To peek at the structure of the data, we can use dplyr::glimpse.
## Load Data ====
# access data via URL
myfile <- RCurl::getURL('https://s3.amazonaws.com/assets.datacamp.com/blog_assets/prince_raw_data.csv',
ssl.verifyhost=FALSE,
ssl.verifypeer=FALSE)
# read data into data frame
data <- read.csv(textConnection(myfile),
header=T,
stringsAsFactors = FALSE)
# remove unwanted columns
data <- data %>% dplyr::select(lyrics = text, song, year, album, peak, us_pop = US.Pop, us_rnb = US.R.B)
# take a peek at the first row data
dplyr::glimpse(data[1,])
## Observations: 1
## Variables: 7
## $ lyrics <chr> "All 7 and we'll watch them fall\nThey stand in the way...
## $ song <chr> "7"
## $ year <int> 1992
## $ album <chr> "Symbol"
## $ peak <int> 3
## $ us_pop <chr> "7"
## $ us_rnb <chr> "61"
Once the data is loaded, we need to prepare the data for analysis. This process differs widely between investigators, but usually includes the same core sets of tasks. Here, we will clean our data by expanding contractions (e.g. won’t would be changed to will not and can’t would be changed to can not), removing special characters (e.g. punctuation), and converting all of the text to lower case.
## Clean Data ====
# Contradictions
fix.contractions <- function(doc) {
# "won't" is a special case as it does not expand to "wo not"
doc <- gsub("won't", "will not", doc)
doc <- gsub("can't", "can not", doc)
doc <- gsub("n't", " not", doc)
doc <- gsub("'ll", " will", doc)
doc <- gsub("'re", " are", doc)
doc <- gsub("'ve", " have", doc)
doc <- gsub("'m", " am", doc)
doc <- gsub("'d", " would", doc)
# 's could be 'is' or could be possessive: it has no expansion
doc <- gsub("'s", "", doc)
return(doc)
}
# fix (expand) contractions
data$lyrics <- sapply(data$lyrics, fix.contractions)
# Special Characters
# remove special characters
data$lyrics <- sapply(data$lyrics, function(x) gsub("[^a-zA-Z0-9 ]", " ", x))
# Case
# convert everything to lower case
data$lyrics <- sapply(data$lyrics, tolower)
# preview the cleaned data
data$lyrics[1]
## [1] "all 7 and we will watch them fall they stand in the way of love and we will smoke them all with an intellect and a savior faire no one in the whole universe will ever compare i am yours now and you are mine and together we will love through all space and time so do not cry one day all 7 will die all 7 and we will watch them fall they stand in the way of love and we will smoke them all with an intellect and a savior faire no one in the whole universe will ever compare i am yours now and you are mine and together we will love through all space and time so do not cry one day all 7 will die and i saw an angel come down unto me in her hand she holds the very key words of compassion words of peace and in the distance an army marching feet 1 2 3 4 1 2 3 4 but behold we will watch them fall and we lay down on the sand of the sea and before us animosity will stand and decree that we speak not of love only blasphemy and in the distance 6 others will curse me but that alright that alright 4 i will watch them fall 1 2 3 4 5 6 7 all 7 and we will watch them fall they stand in the way of love and we will smoke them all with an intellect and a savior faire no one in the whole universe will ever compare i am yours now and you are mine and together we will love through all space and time so do not cry one day all 7 will die just how old and we will see a plague and a river of blood and every evil soul will surely die in spite of their 7 tears but do not fear 4 in the distance 12 souls from now you and me will still be here we will still be here there will be a new city with streets of gold the young so educated they never grow old and a there will be no death for with every breath the voice of many colors sings a song that so bold sing it while we watch them fall all 7 and we will watch them fall they stand in the way of love and we will smoke them all with an intellect and a savior faire no one in the whole universe will ever compare i am yours now and you are mine and together we will love through all space and time so do not cry one day all 7 will die just how old just how old just how old"
More often than not you will find that to perform a meaningful analysis you need to add some additional information to your data. Here, we can use the existing data to create three new variables:
* decade (1970-2010)
* chart_level (Top 10, Top 50, Top 100, and Uncharted)
* charted (Charted, Uncharted)
By creating these variables, we provide additional avenues to investigate the data. Once we have added these variables to the original data, we save a copy of it so we can reference it at a later time.
## Create Additional Features ====
# create a variable to store years as decades
data <- data %>% dplyr::mutate(decade =
ifelse(data$year %in% 1978:1979, "1970s",
ifelse(data$year %in% 1980:1989, "1980s",
ifelse(data$year %in% 1990:1999, "1990s",
ifelse(data$year %in% 2000:2009, "2000s",
ifelse(data$year %in% 2010:2015, "2010s",
"NA"))))))
# create a variable for chart level
data <- data %>% dplyr::mutate(chart_level =
ifelse(data$peak %in% 1:10, "Top 10",
ifelse(data$peak %in% 11:50, "Top 50",
ifelse(data$peak %in% 51:100, "Top 100",
"Uncharted"))))
# create binary variable to indicate if a song hit the Billboard Charts
data <- data %>% dplyr::mutate(charted = ifelse(data$peak %in% 1:100, "Charted", "Uncharted"))
# save the dataset with newly added features to .csv
write.csv(data, file = "Data/prince_new.csv")
The final step before analyzing the data is to identify words within the songs that add unneeded noise to the data. In text mining, these types of words are often called stop words. It is in our best interest to remove as many of these prior to analysis as possible. There are many ways to go about this, including creating your own list of words (as shown below undesirable_words). You can also access lists available from R libraries like tidytext::stop_words and quanteda::stopwords("english"). A sample of what the words included in tidytext::stop_words is printed below. Our last step is to remove words of length 3 or less.
## Tokenization ====
# create a custom list of meaningless words to remove
undesirable_words <- c("prince", "chorus", "repeat", "lyrics",
"theres", "bridge", "fe0f", "yeah", "baby",
"alright", "wanna", "gonna", "chorus", "verse",
"whoa", "gotta", "make", "miscellaneous", "2",
"4", "ooh", "uurh", "pheromone", "poompoom", "3121",
"matic", " ai ", " ca ", " la ", "hey", " na ",
" da ", " uh ", " tin ", " ll", "transcription",
"repeats")
# view the tidytext list of stop words
head(sample(tidytext::stop_words$word, 15), 15)
## [1] "here's" "because" "today" "seconds" "yet"
## [6] "very" "off" "rooms" "facts" "via"
## [11] "presenting" "furthers" "other" "mustn't" "showing"
# unnest and remove stop, undesirable and short words
prince_words_filtered <- data %>%
tidytext::unnest_tokens(word, lyrics) %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words) %>%
dplyr::filter(nchar(word) > 3)
# view the tokenized, unsummarized, tidy data structure
prince_words_filtered %>%
dplyr::filter(word == "race") %>%
dplyr::select(word, song, year, peak, decade, chart_level, charted) %>%
dplyr::arrange() %>%
dplyr::top_n(10, song) %>%
dplyr::mutate(song = formattable::color_tile("plum","plum")(song)) %>%
dplyr::mutate(word = formattable::color_tile("aquamarine","aquamarine")(word)) %>%
my_kable_styling(caption = "Tokenized Format Example")
The table below provides an example of what the cleaned, tokenized data looks like for the top 10 songs that include the word race.
| word | song | year | peak | decade | chart_level | charted |
|---|---|---|---|---|---|---|
| race | lovesexy | 1988 | 1 | 1980s | Top 10 | Charted |
| race | my tree | NA | NA | NA | Uncharted | Uncharted |
| race | positivity | 1988 | NA | 1980s | Uncharted | Uncharted |
| race | race | 1994 | NA | 1990s | Uncharted | Uncharted |
| race | sexuality | 1981 | 88 | 1980s | Top 100 | Charted |
| race | slow love | 1987 | NA | 1980s | Uncharted | Uncharted |
| race | the rest of my life | 1999 | NA | 1990s | Uncharted | Uncharted |
| race | the undertaker | NA | NA | NA | Uncharted | Uncharted |
| race | u make my sun shine | NA | NA | NA | Uncharted | Uncharted |
| race | welcome 2 the rat race | NA | NA | NA | Uncharted | Uncharted |
Now that we have cleaned the data, we want to explore how the variables we created interact with the words within the songs. To begin, we graph the counts of songs over time and for each chart level. As shown in the figure below, we can see that most of the songs that Prince wrote never reached the Billboard charts. We can also see that Prince released the greatest number of songs in the 1990’s.
#look at the full data set at your disposal
data %>%
dplyr::filter(decade != "NA") %>%
dplyr::group_by(decade, chart_level) %>%
dplyr::summarise(number_of_songs = n()) %>%
ggplot() +
ggplot2::geom_bar(aes(x = decade, y = number_of_songs,
fill = chart_level), stat = "identity") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5),
legend.title = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank()) +
ggplot2::labs(x = NULL, y = "Song Count") +
ggplot2::ggtitle("Charted Songs Over Time") +
ggplot2::coord_flip()
Counts of Songs by Decade
We can visualize the same data using a chord diagram, which plots the data in circle and used colored lines to connect different aspects of the circle, which are associated. From this figure, we can see that the orange line from the 1990’s is the thickest and that it connects Uncharted. From this information, we can draw similar conclusions as we did with the prior plot.
decade_chart <- data %>%
dplyr::filter(decade != "NA") %>% # remove songs without release dates
dplyr::count(decade, charted) #Get SONG count per chart level per decade. Order determines top or bottom
# reset the circular layout parameters!
circlize::circos.clear()
grid.col = c("1970s" = my_colors[1], "1980s" = my_colors[2], "1990s" = my_colors[3], "2000s" = my_colors[4], "2010s" = my_colors[5], "Charted" = "grey", "Uncharted" = "grey") #assign chord colors
# set the global parameters for the circular layout. Specifically the gap size
circlize::circos.par(gap.after = c(rep(5, length(unique(decade_chart[[1]])) - 1), 15,
rep(5, length(unique(decade_chart[[2]])) - 1), 15))
# render plto
circlize::chordDiagram(decade_chart, grid.col = grid.col, transparency = .2)
title("Relationship Between Chart Songs Over Time")
Songs that Hit No. 1 on the Billboard Charts
Here, we visualize Prince’s songs that hit number 1 on the Billboard Charts by the release year.
# songs that hit No. 1 on the charts
data %>%
dplyr::filter(peak == "1") %>%
dplyr::select(year, song, peak) %>%
dplyr::arrange(year) %>%
dplyr::mutate(year = formattable::color_tile("plum1", "plum1")(year)) %>%
dplyr::mutate(peak = formattable::color_tile("aquamarine", "aquamarine")(peak)) %>%
my_kable_styling(caption = "Prince's No. 1 Songs")
| year | song | peak |
|---|---|---|
| 1979 | i wanna be your lover | 1 |
| 1984 | erotic city | 1 |
| 1984 | purple rain | 1 |
| 1984 | when doves cry | 1 |
| 1985 | around the world in a day | 1 |
| 1986 | kiss | 1 |
| 1988 | lovesexy | 1 |
| 1989 | batdance | 1 |
| 1990 | thieves in the temple | 1 |
| 1991 | diamonds and pearls | 1 |
| 1995 | the most beautiful girl in the world | 1 |
| 2006 | 3121 | 1 |
| 2007 | planet earth | 1 |
Word Frequency by Song and Chart Level
Here, we explore the counts of words by song and explore what level of the Billboard Charts the songs reached. As shown in the table below, we can see that the majority of the 20 songs with the highest word count never reached the Billboard charts. Only 1 song, “My Name is Prince”, which was a Top 10 Billboard Chart song containing 916 words, is included among the the top 20 songs with the greatest word counts.
## Word Frequency
full_word_count <- data %>%
tidytext::unnest_tokens(word, lyrics) %>%
dplyr::group_by(song,chart_level) %>%
dplyr::summarise(num_words = n()) %>%
dplyr::arrange(desc(num_words))
full_word_count[1:20,] %>%
dplyr::ungroup(num_words, song) %>%
dplyr::mutate(num_words = formattable::color_bar("aquamarine")(num_words)) %>%
dplyr::mutate(song = formattable::color_tile("plum","plum")(song)) %>%
my_kable_styling(caption = "20 Songs With Highest Word Count")
| song | chart_level | num_words |
|---|---|---|
| johnny | Uncharted | 1349 |
| cloreen bacon skin | Uncharted | 1263 |
| push it up | Uncharted | 1240 |
| the exodus has begun | Uncharted | 1072 |
| wild and loose | Uncharted | 1031 |
| jughead | Uncharted | 940 |
| my name is prince | Top 10 | 916 |
| acknowledge me | Uncharted | 913 |
| the walk | Uncharted | 883 |
| the purple medley | Uncharted | 874 |
| extra lovable | Uncharted | 868 |
| segue vi | Uncharted | 864 |
| xtralovable | Uncharted | 862 |
| push | Uncharted | 852 |
| soul psychodelicide 2 | Uncharted | 833 |
| black mf in the house | Uncharted | 825 |
| i rock therefore i am | Uncharted | 822 |
| now | Uncharted | 801 |
| u gotta shake something | Uncharted | 791 |
| do yourself a favor | Uncharted | 779 |
Word Count Distribution
Here we view the distributions of word counts by chart level. From the figure below, we can see that the distributions, regardless of chart level are fairly right skewed, but that the majority of the songs have a word count less than 500.
# visualize distribution of word counts
full_word_count %>%
easyGgplot2::ggplot2.histogram(xName='num_words', groupName='chart_level',
alpha=0.5, position="stack") +
ggplot2::ylab("Song Count") +
ggplot2::xlab("Word Count per Song") +
ggplot2::ggtitle("Word Count Distribution") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5),
legend.title = ggplot2::element_blank(),
panel.grid.minor.y = ggplot2::element_blank())
Most Frequent Words
The figure below illustrates the most frequently used words across all of Prince’s songs. As shown, the top three most frequently used words include “love”, “time”, and “girl”.
# top words
prince_words_filtered %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::top_n(10) %>%
dplyr::ungroup() %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot() +
ggplot2::geom_col(aes(word, n), fill = my_colors[4]) +
ggplot2::theme(legend.position = "none",
plot.title = ggplot2::element_text(hjust = 0.5),
panel.grid.major = ggplot2::element_blank()) +
ggplot2::xlab("") +
ggplot2::ylab("Song Count") +
ggplot2::ggtitle("Most Frequently Used Words in Prince Lyrics") +
ggplot2::coord_flip()
Word Cloud: Top 300 Most Frequent Words
Word Clouds can be really useful tools for visualizing the most frequent words within a corpus. In these images, the size of the word is directly related to the frequency that it occurred. As mentioned above, the most frequent words are “time”, “love”, and “girl”.
# word cloud
prince_words_counts <- prince_words_filtered %>% dplyr::count(word, sort = TRUE)
wordcloud2::wordcloud2(prince_words_counts[1:300, ],
size = .5,
shape = 'circle',
fontFamily = "Arial Narrow")
Word Popularity by Billboard Chart Level:
The faceted plots below illustrate the the counts of the top 8 words by chart level. As shown below, we can see that “love”, “time”, and “money” are the most popular words regardless of chart level.
# Popular words - group by chart level
popular_words <- prince_words_filtered %>%
dplyr::group_by(chart_level) %>%
dplyr::count(word, chart_level, sort = TRUE) %>%
dplyr::slice(seq_len(8)) %>%
dplyr::ungroup() %>%
dplyr::arrange(chart_level,n) %>%
dplyr::mutate(row = row_number())
# generate plot
popular_words %>%
ggplot(aes(row, n, fill = chart_level)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::labs(x = NULL, y = "Song Count") +
ggplot2::ggtitle("Popular Words by Chart Level") +
theme_lyrics() +
ggplot2::facet_wrap(~chart_level, scales = "free") +
ggplot2::scale_x_continuous( # This handles replacement of row
breaks = popular_words$row, # notice need to reuse data frame
labels = popular_words$word) +
ggplot2::coord_flip()
Word Popularity by Decade:
The faceted plots below illustrate the the counts of the top 8 words by decade. As shown below, we can see that “love” and are the most popular words over time.
timeless_words <- prince_words_filtered %>%
dplyr::filter(decade != 'NA') %>%
dplyr::group_by(decade) %>%
dplyr::count(word, decade, sort = TRUE) %>%
dplyr::slice(seq_len(8)) %>%
dplyr::ungroup() %>%
dplyr::arrange(decade,n) %>%
dplyr::mutate(row = row_number())
# generate plot
timeless_words %>%
ggplot2::ggplot(aes(row, n, fill = decade)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::labs(x = NULL, y = "Song Count") +
ggplot2::ggtitle("Timeless Words") +
theme_lyrics() +
ggplot2::facet_wrap(~decade, scales = "free", ncol = 5) +
ggplot2::scale_x_continuous( # This handles replacement of row
breaks = timeless_words$row, # notice need to reuse data frame
labels = timeless_words$word) +
ggplot2::coord_flip()
Word Length Distribution:
The plot below shows the distribution of word length. As shown in the figure, most words in the corpus have a length of less than five characters.
#unnest and remove undesirable words, but leave in stop and short words
prince_word_lengths <- data %>%
tidytext::unnest_tokens(word, lyrics) %>%
dplyr::group_by(song,decade) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words) %>%
dplyr::mutate(word_length = nchar(word))
# generate plot
prince_word_lengths %>%
dplyr::count(word_length, sort = TRUE) %>%
ggplot2::ggplot(aes(word_length), binwidth = 10) +
ggplot2::geom_histogram(aes(fill = ..count..),
breaks = seq(1,25, by = 2),
show.legend = FALSE,
fill ="#56B4E9", colour="dodgerblue") +
ggplot2::xlab("Word Length") +
ggplot2::ylab("Word Count") +
ggplot2::ggtitle("Word Length Distribution") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5),
panel.grid.minor = ggplot2::element_blank())
Word Cloud: Top 100 Longest Words
As mentioned above, Word Clouds are really useful tools for visualizing word frequency. As shown in this Word Cloud, the longest word is “superfunkycalifraagisexy”.
wc <- prince_word_lengths %>%
dplyr::ungroup() %>%
dplyr::select(word, word_length) %>%
dplyr::distinct() %>%
dplyr::arrange(desc(word_length))
# generate plot
wordcloud2::wordcloud2(wc[1:100, ],
size = .2,
minSize = .8,
ellipticity = .3,
fontFamily = "Arial Narrow",
rotateRatio = 1,
fontWeight = "bold",
shape = 'circle')
Lexical Diversity Over Time:
Lexical diversity provides information about how varied a vocabulary is. When it comes to music, we often expect there to be a large number of unique words. With this in mind, we would also expect there to be high lexical diversity. The plot below is exploring the number of unique words within each song over time and by chart level. What we can see from this plot is that the songs that make the Billboard Charts tend to have less lexical diversity over time than those that do not make the Billboard Charts. The songs with the highest lexical diversity are those that were never officially released.
word_summary <- prince_words_filtered %>%
dplyr::mutate(decade = ifelse(is.na(decade),"NONE", decade)) %>%
dplyr::group_by(decade, song) %>%
dplyr::mutate(word_count = n_distinct(word)) %>%
dplyr::select(song, Released = decade, Charted = charted, word_count) %>%
dplyr::distinct() %>% #To obtain one record per song
dplyr::ungroup()
# generate plot
yarrr::pirateplot(formula = word_count ~ Released + Charted, #Formula
data = word_summary, #Data frame
xlab = NULL, ylab = "Song Distinct Word Count", #Axis labels
main = "Lexical Diversity Per Decade", #Plot title
pal = "basel", #Color scheme
point.o = .2, #Points
avg.line.o = 1, #Turn on the Average/Mean line
theme = 0, #Theme
point.pch = 16, #Point `pch` type
point.cex = 1.5, #Point size
jitter.val = .1, #Turn on jitter to see the songs better
cex.lab = .9, cex.names = .7) #Axis label size
Term Frequency - Inverse Document Frequency (TF-IDF):
A common way to transform the data to identify and quantify the most important words in a corpus. TF-IDF adjusts the importance of each word for how rarely it is used across the corpus. The assumption behind this transformation is that terms that appear frequently within each of the documents within a corpus should be weighted less. A breakdown and definition of each concept in TF-IDF is as follows:
* Term Frequency (TF): the number of times a word occurs within a document
* Document Frequency (DF): the number of documents that contain each word
* Inverse Document Frequency (IDF): 1/DF
* TF-IDF: TF*IDF
##TF-IDF
popular_tfidf_words <- prince_words_filtered %>%
# tidytext::unnest_tokens(word, lyrics) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words) %>%
dplyr::filter(nchar(word) > 3) %>%
dplyr::count(chart_level, word, sort = TRUE) %>%
dplyr::ungroup() %>%
tidytext::bind_tf_idf(word, chart_level, n)
As shown in the table below, the top 10 most frequently occurring words, like “love” and “time”, all have an IDF and TF-IDF of 0 (IDF term will be the natural log of 1 and thus will be zero).
popular_tfidf_words[1:10,] %>%
dplyr::ungroup(chart_level, n) %>%
dplyr::mutate(n = formattable::color_tile("aquamarine","aquamarine")(n)) %>%
dplyr::mutate(tf_idf = formattable::color_tile("lightblue","lightblue")(tf_idf)) %>%
dplyr::mutate(chart_level = formattable::color_tile("plum","plum")(chart_level)) %>%
my_kable_styling(caption = "20 Songs With Highest Word Count")
| chart_level | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| Uncharted | love | 386 | 0.0112899 | 0 | 0 |
| Uncharted | time | 348 | 0.0101784 | 0 | 0 |
| Uncharted | girl | 214 | 0.0062591 | 0 | 0 |
| Uncharted | night | 190 | 0.0055572 | 0 | 0 |
| Uncharted | mind | 161 | 0.0047090 | 0 | 0 |
| Uncharted | feel | 155 | 0.0045335 | 0 | 0 |
| Uncharted | play | 153 | 0.0044750 | 0 | 0 |
| Uncharted | body | 150 | 0.0043872 | 0 | 0 |
| Uncharted | hear | 145 | 0.0042410 | 0 | 0 |
| Uncharted | life | 144 | 0.0042118 | 0 | 0 |
TF-IDF: Important Words by Billboard Chart Level:
As shown in the faceted plots below, using TF-IDF changes the top 8 most frequently occurring words by chart level. Looking at the top words for songs that reached the top 10 in the Billboard Charts, I think no one would be surprised to see the top word of “Purple”.
top_popular_tfidf_words <- popular_tfidf_words %>%
dplyr::arrange(desc(tf_idf)) %>%
dplyr::mutate(word = factor(word, levels = rev(unique(word)))) %>%
dplyr::group_by(chart_level) %>%
dplyr::slice(seq_len(8)) %>%
dplyr::ungroup() %>%
dplyr::arrange(chart_level, tf_idf) %>%
dplyr::mutate(row = row_number())
top_popular_tfidf_words %>%
ggplot2::ggplot(aes(x = row, tf_idf, fill = chart_level)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::labs(x = NULL, y = "TF-IDF") +
ggplot2::ggtitle("Important Words using TF-IDF by Chart Level") +
theme_lyrics() +
ggplot2::facet_wrap(~chart_level, ncol = 3, scales = "free") +
ggplot2::scale_x_continuous( # This handles replacement of row
breaks = top_popular_tfidf_words$row, # notice need to reuse data frame
labels = top_popular_tfidf_words$word) +
ggplot2::coord_flip()
TF-IDF: Important Words by Decade:
Comparing this plot to the one without using TF-IDF yields very different results. Using this transformation, we can be confident that we are capturing the most important words for each song because we are considering the frequency of each word within and across all songs we are looking at. These plots should have you thinking about what might have influenced Prince to write each of these songs. It is highly likely (as we will show later), that outside events may have had an effect on the appearance of these words.
tfidf_words_decade <- prince_words_filtered %>%
# tidytext::unnest_tokens(word, lyrics) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words & decade != 'NA') %>%
dplyr::filter(nchar(word) > 3) %>%
dplyr::count(decade, word, sort = TRUE) %>%
dplyr::ungroup() %>%
tidytext::bind_tf_idf(word, decade, n) %>%
dplyr::arrange(desc(tf_idf))
top_tfidf_words_decade <- tfidf_words_decade %>%
dplyr::group_by(decade) %>%
dplyr::slice(seq_len(8)) %>%
dplyr::ungroup() %>%
dplyr::arrange(decade, tf_idf) %>%
dplyr::mutate(row = row_number())
top_tfidf_words_decade %>%
ggplot2::ggplot(aes(x = row, tf_idf, fill = decade)) +
ggplot2::geom_col(show.legend = NULL) +
ggplot2::labs(x = NULL, y = "TF-IDF") +
ggplot2::ggtitle("Important Words using TF-IDF by Decade") +
theme_lyrics() +
ggplot2::facet_wrap(~decade, ncol = 3, nrow = 2, scales = "free") +
ggplot2::scale_x_continuous( # this handles replacement of row
breaks = top_tfidf_words_decade$row, # notice need to reuse data frame
labels = top_tfidf_words_decade$word) +
ggplot2::coord_flip()
For this portion of the tutorial, we will be borrowing and expanding on a tutorial from Data Camp. Sentiment analysis is a text mining technique that aims to determine the underlying opinion or subjectivity of a corpus. In tour case, we want to use sentiment analysis to better understand what may have motivated some of Prince’s most popular songs. Much like other text mining techniques, there are many different ways that to approach this analysis technique. For this workshop, we will utilize three different lexicons provided by as part of the tidytext::sentiments data set. As described in the tutorial, three lexicons include:
* AFINN: assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment
* Bing: assigns words into positive and negative categories
* NRC: assigns words into one or more of the following ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
new_sentiments <- tidytext::sentiments %>% #From the tidytext package
dplyr::filter(lexicon != "loughran") %>% #Remove the finance lexicon
dplyr::mutate(sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
ifelse(lexicon == "AFINN" & score < 0,"negative",
sentiment))) %>%
dplyr::group_by(lexicon) %>%
dplyr::mutate(words_in_lexicon = n_distinct(word)) %>%
dplyr::ungroup()
Matching Lexicons: It is important to determine how well each of the lexicons cover the words within our corpus. As shown in the table below, the NRC lexicon has the greatest number of words that are also included in Prince’s song lyrics. This table also provides information on the relative size of each of the sentiment lexicons. It should be noted that we would not expect every word in a corpus to be included in a lexicon, especially if they were not created for song lyrics.
prince_words_filtered %>%
dplyr::mutate(words_in_lyrics = n_distinct(word)) %>%
dplyr::inner_join(new_sentiments) %>%
dplyr::group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
dplyr::summarise(lex_match_words = n_distinct(word)) %>%
dplyr::ungroup() %>%
dplyr::mutate(total_match_words = sum(lex_match_words), #Not used but good to have
match_ratio = lex_match_words / words_in_lyrics) %>%
dplyr::select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
dplyr::mutate(lex_match_words = formattable::color_bar("pink")(lex_match_words)) %>%
my_kable_styling(caption = "Lyrics Found In Lexicons")
| lexicon | lex_match_words | words_in_lyrics | match_ratio |
|---|---|---|---|
| AFINN | 743 | 7371 | 0.1008004 |
| bing | 1153 | 7371 | 0.1564238 |
| nrc | 1605 | 7371 | 0.2177452 |
Word Dependency:
As we found out earlier, love is an important concept that is included in many of Prince’s songs. The table below shows you the sentiment attached to the different uses of the word love in our lyrics corpus across the sentiment lexicons. It’s very interesting to see the coverage of the different forms of love across the different lexicons. Aside form AFINN, there seems to be a relatively even coverage by NRC and bing.
my_word_list <- data %>%
tidytext::unnest_tokens(word, lyrics) %>%
dplyr::filter(grepl("love", word)) %>%
dplyr::count(word) %>%
dplyr::select(myword = word, n) %>% #Rename word
dplyr::arrange(desc(n))
new_sentiments %>%
#Right join gets all words in `my_word_list` to show nulls
dplyr::right_join(my_word_list, by = c("word" = "myword")) %>%
dplyr::filter(word %in% my_word_list$myword) %>%
dplyr::mutate(instances = color_tile("pink", "pink")(n)) %>%
dplyr::select(-score, -n) %>% #Remove these fields
my_kable_styling(caption = "Dependency on Word Form")
| word | sentiment | lexicon | words_in_lexicon | instances |
|---|---|---|---|---|
| love | joy | nrc | 6468 | 1937 |
| love | positive | nrc | 6468 | 1937 |
| love | positive | bing | 6785 | 1937 |
| love | positive | AFINN | 2476 | 1937 |
| lover | anticipation | nrc | 6468 | 167 |
| lover | joy | nrc | 6468 | 167 |
| lover | positive | nrc | 6468 | 167 |
| lover | trust | nrc | 6468 | 167 |
| lover | positive | bing | 6785 | 167 |
| loves | positive | bing | 6785 | 37 |
| loved | positive | bing | 6785 | 31 |
| loved | positive | AFINN | 2476 | 31 |
| lovers | NA | NA | NA | 17 |
| lovesexy | NA | NA | NA | 16 |
| lovely | anticipation | nrc | 6468 | 11 |
| lovely | joy | nrc | 6468 | 11 |
| lovely | positive | nrc | 6468 | 11 |
| lovely | sadness | nrc | 6468 | 11 |
| lovely | surprise | nrc | 6468 | 11 |
| lovely | trust | nrc | 6468 | 11 |
| lovely | positive | bing | 6785 | 11 |
| lovely | positive | AFINN | 2476 | 11 |
| clover | NA | NA | NA | 8 |
| glove | NA | NA | NA | 6 |
| hardrocklover | NA | NA | NA | 6 |
| beloved | positive | bing | 6785 | 4 |
| beloved | positive | AFINN | 2476 | 4 |
| loveleft | NA | NA | NA | 4 |
| loveright | NA | NA | NA | 4 |
| love4oneanother | NA | NA | NA | 3 |
| anotherloverholenyohead | NA | NA | NA | 2 |
| gloved | NA | NA | NA | 1 |
| gloves | NA | NA | NA | 1 |
| lovelovelovelove | NA | NA | NA | 1 |
| lovemaking | joy | nrc | 6468 | 1 |
| lovemaking | positive | nrc | 6468 | 1 |
| lovemaking | trust | nrc | 6468 | 1 |
| uestlove | NA | NA | NA | 1 |
The Sentiment Behind the Lyrics:
By creating separate data sets for each lexicon’s mapping to the Prince lyrics, we can get a richer sense of what these words might mean. As shown in the NRC plot, most the sentiment is primarily positive when it comes to Prince’s lyrics. This is very different than what we see when we look at Bing, which is much more evenly split between positive and negative sentiment.
# create separate data sets for each sentiment lexicon
# bing
prince_bing <- prince_words_filtered %>%
dplyr::inner_join(get_sentiments("bing"))
# nrc
prince_nrc <- prince_words_filtered %>%
dplyr::inner_join(get_sentiments("nrc"))
# nrc sub
prince_nrc_sub <- prince_words_filtered %>%
dplyr::inner_join(get_sentiments("nrc")) %>%
dplyr::filter(!sentiment %in% c("positive", "negative"))
# generate plots - nrc
nrc_plot <- prince_nrc %>%
dplyr::group_by(sentiment) %>%
dplyr::summarise(word_count = n()) %>%
dplyr::ungroup() %>%
dplyr::mutate(sentiment = reorder(sentiment, word_count)) %>%
#Use `fill = -word_count` to make the larger bars darker
ggplot2::ggplot(aes(sentiment, word_count, fill = -word_count)) +
ggplot2::geom_col() + ggplot2::guides(fill = FALSE) + #Turn off the legend
theme_lyrics() + ggplot2::labs(x = NULL, y = "Word Count") +
ggplot2::scale_y_continuous(limits = c(0, 15000)) + #Hard code the axis limit
ggplot2::ggtitle("Prince NRC Sentiment") +
ggplot2::coord_flip()
# generate plots - bing
bing_plot <- prince_bing %>%
dplyr::group_by(sentiment) %>%
dplyr::summarise(word_count = n()) %>%
dplyr::ungroup() %>%
dplyr::mutate(sentiment = reorder(sentiment, word_count)) %>%
ggplot2::ggplot(aes(sentiment, word_count, fill = sentiment)) +
ggplot2::geom_col() +ggplot2::guides(fill = FALSE) +
theme_lyrics() + ggplot2::labs(x = NULL, y = "Word Count") +
ggplot2::scale_y_continuous(limits = c(0, 8000)) +
ggplot2::ggtitle("Prince Bing Sentiment") +
ggplot2::coord_flip()
gridExtra::grid.arrange(nrc_plot, bing_plot, nrow = 2)
The Sentiment Behind the Lyrics: Billboard Chart Level:
When we examine the changes in sentiment by Billboard Chart level we can see that it would appear that the charted songs are slightly more positive in sentiment than the uncharted songs. We should keep in mind that as shown above, the Bing lexicon has more negative than positive words.
prince_polarity_chart <- prince_bing %>%
dplyr::filter(chart_level != "NA") %>%
dplyr::count(sentiment, chart_level) %>%
tidyr::spread(sentiment, n, fill = 0) %>%
dplyr::mutate(polarity = positive - negative,
percent_positive = positive / (positive + negative) * 100)
#Polarity by chart
plot1 <- prince_polarity_chart %>%
ggplot2::ggplot( aes(chart_level, polarity, fill = chart_level)) +
ggplot2::geom_col() +
ggplot2::scale_fill_manual(values = my_colors[2:5]) +
ggplot2::geom_hline(yintercept = 0, color = "red") +
ggplot2::theme(plot.title = element_text(size = 11), legend.position="none") +
ggplot2::xlab(NULL) + ggplot2::ylab(NULL) +
ggplot2::ggtitle("Polarity By Chart Level")
#Percent positive by chart
plot2 <- prince_polarity_chart %>%
ggplot2::ggplot( aes(chart_level, percent_positive, fill = chart_level)) +
ggplot2::geom_col() +
ggplot2::scale_fill_manual(values = c(my_colors[2:5])) +
ggplot2::geom_hline(yintercept = 0, color = "red") +
ggplot2::theme(plot.title = element_text(size = 11), legend.position="none") +
ggplot2::xlab(NULL) + ggplot2::ylab(NULL) +
ggplot2::ggtitle("Percent Positive By Chart Level")
grid.arrange(plot1, plot2, ncol = 2)
The Sentiment Behind the Lyrics: The 90’s:
We can also examine the frequently occurring words within each sentiment category of the NRC lexicon in the 90’s. It is interesting to see the overlap in sentiment categories by word. For example, “Money” is tied to anger, anticipation, joy, surprise, and trust. In contrast, “words” are tied to negative and anger categories.
plot_words_90s <- prince_nrc %>%
dplyr::filter(year %in% c("1990","1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999")) %>%
dplyr::group_by(sentiment) %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::arrange(desc(n)) %>%
dplyr::slice(seq_len(8)) %>% #consider top_n() from dplyr also
dplyr::ungroup()
# generate plot
plot_words_90s %>%
#Set `y = 1` to just plot one variable and use word as the label
ggplot2::ggplot(aes(word, 1, label = word, fill = sentiment )) +
#You want the words, not the points
ggplot2::geom_point(color = "transparent") +
#Make sure the labels don't overlap
ggrepel::geom_label_repel(force = 1,nudge_y = .5,
direction = "y",
box.padding = 0.04,
segment.color = "transparent",
size = 3) +
ggplot2::facet_grid(~sentiment) + theme_lyrics() +
ggplot2::theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
axis.title.x = element_text(size = 6),
panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect("lightgray", fill = NA),
strip.text.x = element_text(size = 8)) +
ggplot2::xlab(NULL) + ylab(NULL) +
ggplot2::ggtitle("90's NRC Sentiment") +
ggplot2::coord_flip()
The Sentiment Behind Purple Rain:
I find it really interesting to explore the sentiment behind the words of a single song. As shown below, I have investigated the sentiment behind the song “Purple Rain”. It’s clear from looking at the pattern of color blocks that the song has positive, joy, and trust tied to it. Having heard the song many times, that’s not surprising.
prince_words_filtered %>%
dplyr::filter(song %in% 'purple rain') %>%
dplyr::distinct(word) %>%
dplyr::inner_join(get_sentiments("nrc")) %>%
ggplot2::ggplot(aes(x = word, fill = sentiment)) +
ggplot2::facet_grid(~sentiment) +
ggplot2::geom_bar() + #Create a bar for each word per sentiment
theme_lyrics() +
ggplot2::theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
strip.text.x= element_text(size = 8)) + #Place the words on the y-axis
ggplot2::xlab(NULL) +ggplot2::ylab(NULL) +
ggplot2::ggtitle("Purple Rain - Sentiment Words") +
ggplot2::coord_flip()
The Sentiment Behind Different Songs Over Time:
The plot below provides insight into the sentiment behind different songs at different times.
prince_nrc_sub %>%
dplyr::filter(song %in% c("so blue", "controversy", "raspberry beret",
"when doves cry", "the future", "1999")) %>%
dplyr::count(song, sentiment, year) %>%
dplyr::mutate(sentiment = reorder(sentiment, n), song = reorder(song, n)) %>%
ggplot2::ggplot(aes(sentiment, n, fill = sentiment)) +
ggplot2::geom_col() +
ggplot2::facet_wrap(year ~ song, scales = "free_x", labeller = label_both) +
theme_lyrics() +
ggplot2::theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank()) +
ggplot2::labs(x = NULL, y = NULL) +
ggplot2::ggtitle("NRC Sentiment Song Analysis") +
ggplot2::coord_flip()
The Sentiment Behind Bigrams:
Bigrams, or words of length two are also important structure for text mining. I won’t go into deal about them today, but as it is very interesting, I have provided a figure using them below. The figure provides an overview of the frequency of bigrams by decade.
prince_bigrams <- data %>%
tidytext::unnest_tokens(bigram, lyrics, token = "ngrams", n = 2)
bigrams_separated <- prince_bigrams %>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
dplyr::filter(!word1 %in% stop_words$word) %>%
dplyr::filter(!word2 %in% stop_words$word) %>%
dplyr::filter(!word1 %in% undesirable_words) %>%
dplyr::filter(!word2 %in% undesirable_words)
#Because there is so much repetition in music, also filter out the cases where the two words are the same
bigram_decade <- bigrams_filtered %>%
dplyr::filter(word1 != word2) %>%
dplyr::filter(decade != "NA") %>%
tidyr::unite(bigram, word1, word2, sep = " ") %>%
dplyr::inner_join(data) %>%
dplyr::count(bigram, decade, sort = TRUE) %>%
dplyr::group_by(decade) %>%
dplyr::slice(seq_len(7)) %>%
dplyr::ungroup() %>%
dplyr::arrange(decade, n) %>%
dplyr::mutate(row = row_number())
## Joining, by = c("song", "year", "album", "peak", "us_pop", "us_rnb", "decade", "chart_level", "charted")
bigram_decade %>%
ggplot2::ggplot(aes(row, n, fill = decade)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(~decade, scales = "free_y") +
ggplot2::xlab(NULL) + ggplot2::ylab(NULL) +
ggplot2::scale_x_continuous( # This handles replacement of row
breaks = bigram_decade$row, # Notice need to reuse data frame
labels = bigram_decade$bigram) +
theme_lyrics() +
ggplot2::theme(panel.grid.major.x = element_blank()) +
ggplot2::ggtitle("Bigrams By Decade") +
ggplot2::coord_flip()
Negation Bigram Networks:
Below, I have created a network to illustrate the word pair associations with negation words. This structure provides a very interesting overview of the different combinations of words that create negative bigrams.
AFINN <- tidytext::get_sentiments("afinn")
not_words <- bigrams_separated %>%
dplyr::filter(word1 == "not") %>%
dplyr::inner_join(AFINN, by = c(word2 = "word")) %>%
dplyr::count(word2, score, sort = TRUE) %>%
dplyr::ungroup()
negation_words <- c("not", "no", "never", "without")
negation_bigrams <- bigrams_separated %>%
dplyr::filter(word1 %in% negation_words) %>%
dplyr::inner_join(AFINN, by = c(word2 = "word")) %>%
dplyr::count(word1, word2, score, sort = TRUE) %>%
dplyr::mutate(contribution = n * score) %>%
dplyr::arrange(desc(abs(contribution))) %>%
dplyr::group_by(word1) %>%
dplyr::slice(seq_len(20)) %>%
dplyr::arrange(word1,desc(contribution)) %>%
dplyr::ungroup()
bigram_graph <- negation_bigrams %>% igraph::graph_from_data_frame()
set.seed(123)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph::ggraph(bigram_graph, layout = "fr") +
ggraph::geom_edge_link(alpha = .25) +
ggraph::geom_edge_density(aes(fill = score)) +
ggraph::geom_node_point(color = "purple1", size = 1) + #Purple for Prince!
ggraph::geom_node_text(aes(label = name), repel = TRUE) +
ggplot2::theme_void() + ggplot2::theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
ggplot2::ggtitle("Negation Bigram Network")
Classification of text is an important task in text mining. There are two types of classification we can use: supervised (e.g. where we have group labels) and unsupervised (e.g. no label or grouping). Here, I will provide very brief examples of topic modeling. To provide you with additional tools, I have included new code that uses the R quanteda library. This part of the tutorial comes from Ken Benoit.
# create a document term matrix and clean data
corpus <- quanteda::corpus(data$lyrics, docvars=data[,2:10])
dtm <- quanteda::dfm(corpus, tolower = TRUE, stem = TRUE, remove_punct = TRUE, remove = c(stop_words, tidytext::stop_words, stopwords("english")))
# filtering out words occurring less than 5 times
doc_freq <- quanteda::docfreq(dtm)
dtm <- dtm[, doc_freq >= 5]
# normalize with tf-idf
dtm <- quanteda::dfm_weight(dtm, "tfidf")
dtm
## Document-feature matrix of: 824 documents, 1,891 features (96.1% sparse).
For the supervised example I will train a binary classifier (Charted vs. Uncharted) using Näive Bayes via R quanteda::textmodel_nb. As explained in the quanteda documentation, “Naive Bayes is a supervised model usually used to classify documents into two or more categories. We train the classifier using class labels attached to documents, and predict the most likely class(es) of new unlabeled documents”. As the contingency table shows, our performance is actually fairly good! We have an accuracy of 84% and a specificity of 88%. That bing said, we have a very low precision (18%) and recall (33%).
## sample data
train_dtm <- quanteda::dfm_sample(dtm, size = nrow(dtm)*.75)
test_dtm <- dtm[dplyr::setdiff(quanteda::docnames(dtm), quanteda::docnames(train_dtm)), ]
# train model
nb_model <- quanteda::textmodel_nb(train_dtm, y = quanteda::docvars(train_dtm, "charted"))
pred_nb <- stats::predict(nb_model, newdata = test_dtm)
# view predictions
caret::confusionMatrix(table(prediction = pred_nb$nb.predicted,
charted = quanteda::docvars(test_dtm, "charted")),
mode = "everything")
## Confusion Matrix and Statistics
##
## charted
## prediction Charted Uncharted
## Charted 3 21
## Uncharted 16 166
##
## Accuracy : 0.8204
## 95% CI : (0.761, 0.8703)
## No Information Rate : 0.9078
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.0408
## Mcnemar's Test P-Value : 0.5108
##
## Sensitivity : 0.15789
## Specificity : 0.88770
## Pos Pred Value : 0.12500
## Neg Pred Value : 0.91209
## Precision : 0.12500
## Recall : 0.15789
## F1 : 0.13953
## Prevalence : 0.09223
## Detection Rate : 0.01456
## Detection Prevalence : 0.11650
## Balanced Accuracy : 0.52280
##
## 'Positive' Class : Charted
##
For the unsupervised using latent Dirichlet allocation (LDA), “a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar” wiki. The results of this analysis are shown in the table below. The results show the top 10 words for each of the 5 topics we predicted. What’s clear from these results is that more pre-processing of the data is likely needed before better results are achieved.
texts = quanteda::corpus_reshape(corpus, to = "paragraphs")
par_dtm <- quanteda::dfm(texts, stem = TRUE, remove_punct = TRUE, remove = c(stop_words, tidytext::stop_words, quanteda::stopwords("english")))
par_dtm <- quanteda::dfm_trim(par_dtm, min_count = 5)
par_dtm <- quanteda::convert(par_dtm, to = "topicmodels")
# set seed and run model
set.seed(1)
lda_model <- topicmodels::LDA(par_dtm, method = "Gibbs", k = 5)
term <- terms(lda_model, 10)
term %>%
kableExtra::kable("html", escape = FALSE, align = "c", caption = "Top 10 Words by Topic") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
full_width = FALSE)
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
|---|---|---|---|---|
| u | babi | will | get | come |
| 2 | can | love | got | go |
| 4 | like | one | everybodi | let |
| la | love | day | parti | get |
| oh | wanna | life | like | got |
| princ | ooh | never | ai | yeah |
| chorus | want | know | rock | now |
| girl | know | just | yeah | can |
| yeah | oh | eye | da | feel |
| ha | just | heart | man | stop |
The reminder of this tutorial is meant to provide you with time to explore biomedical data. We are providing you with data from the 2007 Computational Medicine Center’s 2007 Medical Natural Language Processing Challenge. In your home directory within the “Data” folder you will find the following items:
* 2007ChallengeDescription.pdf
* 2007ChallengeTrainData.xml
* 2007ChallengeTrainSchema.ng
* README
These documents can provide you with the information needed to understand the goal of the challenge as well as the data. To get you started, I have included a transformed version of the xml data as an R data frame. The data will load to a data.frame called “clin_data”. If this is not satisfactory, I also included the same data as a csv called “transformed_2007_challenge_data.csv”.
# read in data
load("Data/transformed_2007_challenge_data.rda")
# take a peek at the first row data
dplyr::glimpse(clin_data[1,])
## Observations: 1
## Variables: 8
## $ doc_id <int> 97634946
## $ type <chr> "RADIOLOGY_REPORT"
## $ COMPANY1 <chr> "514, 786.07"
## $ COMPANY2 <chr> "79.99, 493.9"
## $ COMPANY3 <chr> "786.07"
## $ MAJORITY <chr> "786.07"
## $ CLINICAL_HISTORY <chr> "Seven month old with wheezing, congestion."
## $ IMPRESSION <chr> "Findings consistent with viral or reactive a...
# number of rows and columns
dim(data)
## [1] 824 10
Exploring Clinical Notes:
Below I provide some simple code that begins to explore this data. I encourage you to use what we learned today, in addition to other resources to dive deep into this data and see what you can discover!
# clinical history
words_filtered_ch <- clin_data %>%
tidytext::unnest_tokens(word, CLINICAL_HISTORY) %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words) %>%
dplyr::filter(nchar(word) > 3)
# Top words
ch <- words_filtered_ch %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::top_n(10) %>%
dplyr::ungroup() %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot() +
ggplot2::geom_col(aes(word, n), fill = my_colors[4]) +
ggplot2::theme(legend.position = "none",
plot.title = ggplot2::element_text(hjust = 0.5),
panel.grid.major = ggplot2::element_blank()) +
ggplot2::xlab("") +
ggplot2::ylab("Report Count") +
ggplot2::ggtitle("Most Frequently Used Words in Clinical History Reports") +
ggplot2::coord_flip()
# impression
words_filtered_imp <- clin_data %>%
tidytext::unnest_tokens(word, IMPRESSION) %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::distinct() %>%
dplyr::filter(!word %in% undesirable_words) %>%
dplyr::filter(nchar(word) > 3)
imp <- words_filtered_imp %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::top_n(10) %>%
dplyr::ungroup() %>%
dplyr::mutate(word = reorder(word, n)) %>%
ggplot() +
ggplot2::geom_col(aes(word, n), fill = my_colors[4]) +
ggplot2::theme(legend.position = "none",
plot.title = ggplot2::element_text(hjust = 0.5),
panel.grid.major = ggplot2::element_blank()) +
ggplot2::xlab("") +
ggplot2::ylab("Report Count") +
ggplot2::ggtitle("Most Frequently Used Words in Clinical History Reports") +
ggplot2::coord_flip()
gridExtra::grid.arrange(ch, imp, nrow = 2)
words_counts <- words_filtered_ch %>% dplyr::count(word, sort = TRUE)
wordcloud2::wordcloud2(words_counts[1:500, ],
size = .8,
shape = 'circle',
fontFamily = "Arial Narrow")
words_counts <- words_filtered_imp %>% dplyr::count(word, sort = TRUE)
wordcloud2::wordcloud2(words_counts[1:500, ],
size = .8,
shape = 'circle',
fontFamily = "Arial Narrow")
For additional tutorials, please see the following:
* Welcome to Text Mining with R
* Data Camp - Tutorial 1
* Data Camp - Tutorial 2
* Data Camp - Sentiment Analysis Tutorials * Quanteda - Näive Bayes Tutorial
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bindrcpp_0.2.2 yarrr_0.1.5 BayesFactor_0.9.12-4.1
## [4] coda_0.19-1 jpeg_0.1-8 wordcloud2_0.2.1
## [7] widyr_0.1.1 topicmodels_0.2-7 tm_0.7-3
## [10] NLP_0.1-11 tidytext_0.1.8 tidyr_0.8.0
## [13] SnowballC_0.5.1 spacyr_0.9.9 RTextTools_1.4.2
## [16] SparseM_1.77 RCurl_1.95-4.10 bitops_1.0-6
## [19] quanteda_1.2.0 knitr_1.20 kableExtra_0.8.0
## [22] igraph_1.2.1 formattable_0.2.0.1 gridExtra_2.3
## [25] ggrepel_0.8.0 ggraph_1.0.1 easyGgplot2_1.0.0.9000
## [28] e1071_1.6-8 doMC_1.3.5 iterators_1.0.9
## [31] foreach_1.4.4 dplyr_0.7.4 corpustools_0.3.3
## [34] data.table_1.11.2 Matrix_1.2-14 caret_6.0-79
## [37] ggplot2_2.2.1 lattice_0.20-35 circlize_0.4.3
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.2 fastmatch_1.1-0 maxent_1.3.3.1
## [4] plyr_1.8.4 lazyeval_0.2.1 splines_3.5.0
## [7] digest_0.6.15 htmltools_0.3.6 viridis_0.5.1
## [10] magrittr_1.5 sfsmisc_1.1-2 recipes_0.1.2
## [13] readr_1.1.1 gower_0.1.2 RcppParallel_4.4.0
## [16] dimRed_0.1.0 colorspace_1.3-2 rvest_0.3.2
## [19] bindr_0.1.1 survival_2.42-3 glue_1.2.0
## [22] DRR_0.0.3 stopwords_0.9.0 gtable_0.2.0
## [25] ipred_0.9-6 MatrixModels_0.4-1 kernlab_0.9-26
## [28] ddalpha_1.3.3 shape_1.4.4 DEoptimR_1.0-8
## [31] abind_1.4-5 scales_0.5.0 mvtnorm_1.0-7
## [34] Rcpp_0.12.16 viridisLite_0.3.0 magic_1.5-8
## [37] units_0.5-1 foreign_0.8-70 stats4_3.5.0
## [40] lava_1.6.1 prodlim_2018.04.18 glmnet_2.0-16
## [43] htmlwidgets_1.2 httr_1.3.1 modeltools_0.2-21
## [46] pkgconfig_2.0.1 nnet_7.3-12 labeling_0.3
## [49] tidyselect_0.2.4 rlang_0.2.0 reshape2_1.4.3
## [52] munsell_0.4.3 tools_3.5.0 broom_0.4.4
## [55] evaluate_0.10.1 geometry_0.3-6 stringr_1.3.1
## [58] yaml_2.1.19 tree_1.0-39 ModelMetrics_1.1.0
## [61] robustbase_0.93-0 caTools_1.17.1 purrr_0.2.4
## [64] randomForest_4.6-14 pbapply_1.3-4 nlme_3.1-137
## [67] slam_0.1-43 RcppRoll_0.2.2 tau_0.0-20
## [70] xml2_1.2.0 tokenizers_0.2.1 compiler_3.5.0
## [73] rstudioapi_0.7 tibble_1.4.2 tweenr_0.1.5
## [76] stringi_1.2.2 psych_1.8.4 pillar_1.2.2
## [79] GlobalOptions_0.0.13 R6_2.2.2 janeaustenr_0.1.5
## [82] codetools_0.2-15 MASS_7.3-50 gtools_3.5.0
## [85] assertthat_0.2.0 CVST_0.2-1 rprojroot_1.3-2
## [88] withr_2.1.2 mnormt_1.5-5 hms_0.4.2
## [91] ISOcodes_2017.09.27 udunits2_0.13 grid_3.5.0
## [94] rpart_4.1-13 timeDate_3043.102 class_7.3-14
## [97] rmarkdown_1.9 ggforce_0.1.1 lubridate_1.7.4